CLAIMS 

1 . A method of extracting relevant data, comprising: 

accessing at least a first set of data of a first document, the first document including 
markup language, wherein the first set of data includes selected data of the first document, 
5 the selected data at least partly specifying document data; 

accessing at least a second set of data of a second document, the second document 
including markup language; 

determining an edit sequence between at least part of the first set of data and at 
least part of the second set of data, the edit sequence including any of insertions, deletions, 
10 and substitutions; and 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
determining the edit sequence. 

15 2. The method of claim 1, wherein the edit sequence includes none of insertions, 
deletions, and substitutions. 

3. The method of claim 1, wherein the edit sequence includes at least one of one or 
more insertions, one or more deletions, and one or more substitutions. 

20 

4. The method of claim 1 , wherein the edit sequence is at least partly determined by 
calculating a total cost, and each of one or more of insertions, deletions, substitutions, and 
matches is associated with one or more costs. 

25 5. The method of claim 4, wherein the one or more costs are at least partly set to 
encourage the edit sequence to include one or more matches between at least some markup 
language from the selected data of the first document and at least some markup language 
from the second document, the markup language including text-based content and tags. 

30 6. The method of claim 4, wherein a first cost is associated with a first match at a first 
distance fi-om a root of a tree representation of some set of data, a second cost is associated 
with a second match at a second distance from a root of a tree representation of some set of 
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data, the first distance is less than the second distance, and the first cost and the second 
cost are set to encourage the first match more than the second match. 



7. The method of claim 4, wherein a first cost is associated with a first insertion at a 
5 first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second insertion at a second distance from a root of a tree representation 
of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 

10 8. The method of claim 4, wherein a first cost is associated with a first deletion at a 
first distance from a root of a tree representation of some set of data, a second cost is 

: associated with a second deletion at a second distance from a root of a tree representation 
: J of some set of data, the first distance is less than the second distance, and the first cost and 
i : the second cost are different. 

15 

9. The method of claim 4, wherein a first cost is associated with a first substitution at 
: a first distance from a root of a tree representation of some set of data, a second cost is 

? U associated with a second substitution at a second distance from a root of a tree 
i: 3 representation of some set of data, the first distance is less than the second distance, and 
' 20 the first cost and the second cost are different. 

10. The method of claim 4, wherein a first cost is associated with a first text-based 
content substitution such that a first length of substituting text-based content is 
substantially equal to a first length of substituted text-based content, a second cost is 

25 associated with a second text-based content substitution such that a second length of 
substituting text-based content is substantially different fi-om a second length of substituted 
text-based content, and the first cost and the second cost are set to discourage the second 
text-based content substitution more than the first text-based content substitution. 

30 11, The method of claim 4, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of text-based content for one or more tags. 
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12. The method of claim 4, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of one or more tags for text-based content. 

5 

13. The method of claim 4, wherein a first cost is associated with preserving a first tag 
with unchanged attributes, a second cost is associated with preserving a second tag with 
one or more changed attributes, and the first cost and the second cost are set to discourage 
preserving the second tag more than preserving the first tag. 

10 

14. The method of claim 1, wherein document data is at least partly from the first 
i^^T document. 

15. The method of claim 1, wherein document data is at least partly from the second 
Ml 5 document. 

=f 16. The method of claim 1, wherein the second document is received if the second 
ry document is different firom the first document. 

■ " 20 17. The method of claim 1, wherein the markup language includes at least HTML 
(Hypertext Markup Language). 

18. The method of claim 1, wherein the markup language includes at least one of 
XML, a subset of XML, and a specialization of XML (extensible Markup Language). 

25 

19. The method of claim 1 wherein the markup language includes at least WML 
(Wireless Markup Language). 

20. The method of claim 1, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML (Standard Generalized Markup 

30 Language). 
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21. The method of claim 1, wherein the markup language includes at least text-based 
content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

5 22. The method of claim 1, wherein the correspondence is at least partly found by one 
or more of: determining the edit sequence, at least part of at least one of a first plurality of 
paths from a root of a tree representation of the first set of data to selected data of the tree 
representation of the first set of data, at least part of at least one of a second plurality of 
paths from a root of a tree representation of the second set of data to corresponding data of 
1 0 the tree representation of the second set of data, and one or more edit sequences between at 
least one of the first plurality of paths and at least one of the second plurality of paths. 

'1 23. The method of claim 1, wherein one or more of the first set of data and the second 
set of data is represented at least partly by a tree. 

mi5 

r' 24. The method of claim 1, wherein one or more of the first set of data and the second 
J:;; set of data is represented at least partly by a set of linearized tokens. 

^ 25. The method of claim 1, wherein at least the first document and the second 
■ ' 20 document represent different documents. 

26. The method of claim 1, wherein the first document and the second document 
represent a same document. 

25 27. The method of claim 1, wherein the first document and the second document 
represent different versions of a same document. 
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28. A method of extracting relevant data, comprising: 

accessing at least a first set of data of a first document, the first document including 
markup language, wherein the first set of data includes selected data of the first document, 
the selected data at least partly specifying docximent data; 
5 accessing at least a second set of data of a second document, the second document 

including markup language; 

determining a tree-based edit sequence between at least part of the first set of data 
and at least part of the second set of data, the tree-based edit sequence including any of 
insertions, deletions, and substitutions; and 
10 finding corresponding data of the second set of data, the corresponding data having 

a correspondence to the selected data, the correspondence at least partly found by 
,3 determining the tree-based edit sequence. 

; ; 29. The method of claim 28, wherein the tree-based edit sequence includes none of 
lO 15 insertions, deletions, and substitutions. 

J 30. The method of claim 28, wherein the tree-based edit sequence includes at least one 
y of one or more insertions, one or more deletions, and one or more substitutions. 

* 20 31. The method of claim 28, wherein the tree-based edit sequence is at least partly 
determined by calculating a total cost, and each of one or more of insertions, deletions, 
substitutions, and matches is associated with one or more costs. 

32. The method of claim 3 1 , wherein the one or more costs are at least partly set to 
25 encourage the tree-based edit sequence to include one or more matches between at least 
some markup language from the selected data of the first document and at least some 
markup language from the second document, the markup language including text-based 
content and tags. 

30 33. The method of claim 31, wherein a first cost is associated with a first match at a 
first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second match at a second distance from a root of a free representation of 
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some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are set to encourage the first match more than the second match. 

34. The method of claim 31, wherein a first cost is associated with a first insertion at a 
5 first distance firom a root of a tree representation of some set of data, a second cost is 
associated with a second insertion at a second distance from a root of a tree representation 
of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 

10 35. The method of claim 3 1 , wherein a first cost is associated with a first deletion at a 
first distance firom a root of a tree representation of some set of data, a second cost is 
ci associated with a second deletion at a second distance fi-om a root of a tree representation 
f% of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 

-115 

m 

■ '" 36. The method of claim 31, wherein a first cost is associated with a first substitution 

if at a first distance firom a root of a tree representation of some set of data, a second cost is 

i LI associated with a second substitution at a second distance fi-om a root of a tree 

O representation of some set of data, the first distance is less than the second distance, and 

20 the first cost and the second cost are different. 

37. The method of claim 31, wherein a first cost is associated with a first text-based 
content substitution such that a first length of substituting text-based content is 
substantially equal to a first length of substituted text-based content, a second cost is 
25 associated with a second text-based content substitution such that a second length of 
substituting text-based content is substantially different from a second length of substituted 
text-based content, and the first cost and the second cost are set to discourage the second 
text-based content substitution more than the first text-based content substitution. 

30 38. The method of claim 31, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of text-based content for one or more tags. 
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39. The method of claim 31, wherein markup language includes at le^t text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of one or more tags for text-based content. 

5 

40. The method of claim 31, wherein a first cost is associated with preserving a first 
tag with unchanged attributes, a second cost is associated with preserving a second tag 
with one or more changed attributes, and the first cost and the second cost are set to 
discourage preserving the second tag more than preserving the first tag. 

10 

41. The method of claim 28, wherein document data is at least partly from the first 
i:j document. 

42. The method of claim 28, wherein docimient data is at least partly from the second 
CO 15 document, 

43. The method of claim 28, wherein the second document is received if the second 
U document is different from the first document. 

20 44. The method of claim 28, wherein the markup language includes at least HTML 
(Hypertext Markup Language). 

45. The method of claim 28, wherein the markup language includes at least one of 
XML, a subset of XML, and a specialization of XML (extensible Markup Language). 
25 46. The method of claim 28, wherein the markup language includes at least WML 
(Wireless Markup Language). 

47. The method of claim 28, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML (Standard Generalized Markup 
30 Language). 
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48. The method of claim 28, wherein the markup language includes at least text-based 
content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

5 49. The method of claim 28, further comprising: 

if two or more corresponding data are found, then: 

selecting larger selected data, at least part of the larger selected data 
including a larger subtree in a first tree representation of the first set of data, the larger 
subtree including the selected data; 
10 determining a second edit sequence between at least part of the first set of 

data and at least part of a second tree representation of the second tree of data, the first set 
of data including at least part of the larger selected data, the second edit sequence 
including any of insertions, deletions, and substitutions; 

finding corresponding data of the second set of data, the corresponding data having 

ilil 1 5 a correspondence to the larger selected data, the correspondence at least partly found by 

f.n 

determining the second edit sequence; and 
; finding corresponding data of the second set of data, the corresponding data having 

} y a correspondence to the selected data, the correspondence at least partly foimd by 
determining the second edit sequence. 

50. The method of claim 28, wherein the correspondence is at least partly found by one 
or more of: determining the tree-based edit sequence, at least part of at least one of a first 
plurality of paths from a root of a tree representation of the first set of data to selected data 
of the tree representation of the first set of data, at least part of at least one of a second 
25 plurality of paths from a root of a tree representation of the second set of data to 
corresponding data of the tree representation of the second set of data, and one or more 
tree-based edit sequences between at least one of the first plurality of paths and at least one 
of the second plurahty of paths. 

30 51. The method of claim 28, wherein one or more of the first set of data and the second 
set of data is represented at least partly by a tree. 
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52. The method of claim 28, wherein one or more of the first set of data and the second 
set of data is represented at least partly by a set of linearized tokens. 

53. The method of claim 28, wherein at least the first document and the second 
5 document represent different documents. 

54. The method of claim 28, wherein the first document and the second document 
represent a same document. 

10 55. The method of claim 28, wherein the first document and the second document 
represent different versions of a same document. 

]7j 56. The method of claim 28, further comprising: 

determining at least one edit sequence of forward and backward edit sequences 
W15 between at least part of a first tree representation of the first set of data and at least part of 

a second tree representation of the second set of data; 
lis performing at least one of 1) and 2): 

ru la) pruning a relevant subtree firom at least part of the first tree representation, 

j:3 the relevant subtree at least partly determined fi^om the forward and backward edit 
" 20 sequences; 

lb) determining a pruned edit sequence between the pruned relevant subtree 
and at least part of the second tree representation; 

2a) pruning a relevant subtree fi-om at least part of the second tree 
representation, the relevant subtree at least partly determined from the forward and 
25 backward edit sequences; 

2b) determining a pruned edit sequence between at least part of the first tree 
representation and the pruned relevant subtree; and 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
30 determining the pruned edit sequence. 

57. A method of extraction, comprising: 
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accessing at least a first set of data of a first document, the first document including 
markup language, wherein the first set of data includes selected data, the selected data at 
least partly specifying document data; 

accessing at least a second set of data of a second document, the second document 
5 including markup language; 

determining document data of the second set of data, by finding corresponding data 
of the second set of data, the corresponding data having a correspondence to the selected 
data of the first set of data; 

identifying the corresponding data of the second set of data as selected data of the 
10 second set of data, the selected data at least partly specifying document data; 

accessing at least a third set of data of a third document, the third document 
iJ including markup language; and 

i i i determining document data of the third set of data, by finding corresponding data 

of the third set of data, the corresponding data having a correspondence to at least one of 
s 0 1 5 the selected data of the first set of data and the selected data of the second set of data. 

l^iJ 58. The method of claim 57, wherein subsequent sets of data of documents are 
j received, the documents including markup language, document data of the subsequent sets 

l: j of data are determined by finding corresponding data of the subsequent sets of data, the 

is 4 

20 corresponding data of the subsequent sets correspond to the selected data of earlier sets of 
data, the corresponding data of the subsequent sets are identified as selected data of the 
subsequent sets of data, the selected data of the subsequent sets of data at least partly 
specifying document data, and at least one of selected data of the earlier sets and the 
selected data of the subsequent data at least partly determine corresponding data of later 

25 sets of data, the earlier sets of data are received earlier than the subsequent sets of data, and 
the later sets of data are received later than the subsequent sets of data. 

59. The method of claim 57, wherein document data is at least partly firom the first 
document. 

30 

60. The method of claim 57, wherein document data is at least partly from the second 
document. 
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61. The method of claim 57, wherein docimient data is at least partly from the third 
document. 

5 62. The method of claim 57, wherein the second document is received if the second 
document is different from the first document. 

63. The method of claim 57, wherein the markup language includes at least HTML 
(Hypertext Markup Language). 

10 

64. The method of claim 57, wherein the markup language includes at least one of 
;;3 XML, a subset of XML, and a specialization of XML (extensible Markup Language). 

j| 65. The method of claim 57, wherein the markup language includes at least WML 
1 5 (Wireless Markup Language). 

"i 66. The method of claim 57, wherein the markup language includes at least one of 
(U SGML, a subset of SGML, and a speciahzation of SGML (Standard Generalized Markup 
O Language). 

■ * 20 

67. The method of claim 57, wherein the markup language includes at least text-based 
content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

25 68. The method of claim 57, wherein at least two of the first document, the second 
document, and the third document represent different docimients. 

69. The method of claim 57, wherein at least two of the first document, the second 
document, and the third docvanent represent a same document. 

30 

70. The method of claim 57, wherein at least two of the first document, the second 
dociunent, and the third document represent different versions of a same document. 
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71 . A method of extraction, comprising: 

accessing at least a first set of data of a first document, the first document including 
markup language, wherein the first set of data includes selected data, the selected data at 
5 least partly specifying document data; 

accessing at least a second set of data of a second document, the second document 
including markup language; 

finding one or more sets of corresponding data of the second set of data, each of 
one or more sets of corresponding data having a strength of correspondence to the selected 
1 0 data of the first set of data; 

if two or more sets of corresponding data are found, then 1) if one of the 
1:3 corresponding sets of data has a substantially higher strength of correspondence than 
Q strengths of correspondence of the other corresponding sets of data, assigning a high 
: ; measure of quality to the selection of the selected data, and 2) assigning a low measure of 
- 1 5 quality to the selection of the selected data, if at least one of: 2a) none of the corresponding 
I?. sets of data has a substantially higher strength of correspondence than strengths of 

correspondence of the other corresponding sets of data, and 2b) if strengths of 
correspondence of all corresponding sets of data are low. 
:3 72. The method of claim 71, wherein document data is at least partly fi-om the first 
20 document. 

73. The method of claim 71, wherein document data is at least partly from the second 
document. 

25 74. The method of claim 71, wherein the second document is received if the second 
document is different from the first document. 

75. The method of claim 71, wherein the markup language includes at least HTML 
(Hypertext Markup Language). 

30 

76. The method of claim 71, wherein the markup language includes at least one of 
XML, a subset of XML, and a specialization of XML (extensible Markup Language). 
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77. The method of claim 71, wherein the markup language includes at least WML 
(Wireless Markup Language). 



5 78. The method of claim 71, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML (Standard Generalized Markup 
Language). 

79. The method of claim 71, wherein the markup language includes at least text-based 
10 content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

i J 80. The method of claim 71, wherein the first document and the second document 
'11 represent different documents. 

81. The method of claim 71, wherein the first document and the second document 

represent a same document, 
^ij 82. The method of claim 71, wherein the first document and the second document 
ii j represent different versions of a same document. 

i=* 20 

83. A method of extraction, comprising: 

accessing at least a first set of data of a first document, the first document including 
markup language, wherein the first set of data includes a first selected subset and a second 
selected subset, such that the second selected subset of data is a subset of the first selected 
25 subset of data, the first selected subset at least partly specifying document data, the second 
selected subset at least partly specifying document data; 

accessing at least a second set of data of a second document, the second document 
including markup language; 

determining a first edit sequence between at least part of the first set of data and at 
30 least part of the second set of data, the first edit sequence including any of insertions, 
deletions, and substitutions; 
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finding a first corresponding subset of the second set of data, the first 
corresponding subset having a correspondence to the first selected subset, the 
correspondence at least partly found by determining the first edit sequence; 

determining a second edit sequence between at least part of the first set of data and 
5 at least part of the second set of data, the first set of data including at least part of the first 
selected subset, the second set of data including at least part of the first corresponding 
subset, the second edit sequence including any of insertions, deletions, and substitutions; 
and 

finding a second corresponding subset of the second set of data, the second 
10 corresponding subset having a correspondence to the second selected subset, the 
correspondence at least partly found by determining the second edit sequence. 

84. The method of claim 83, wherein at least one of the first edit sequence and the 
second edit sequence includes none of insertions, deletions, and substitutions. 

85. The method of claim 83, wherein at least one of the first edit sequence and the 
second edit sequence includes at least one of one or more insertions, one or more 
deletions, and one or more substitutions. 

20 86. The method of claim 83, wherein at least one of the first edit sequence and the 
second edit sequence is at least partly determined by calculating a total cost, and each of 
one or more of insertions, deletions, substitutions, and matches is associated with one or 
more costs. 

25 87. The method of claim 86, wherein the one or more costs are at least partly set to 
encourage the edit sequence to include one or more matches between at least some markup 
language firom the selected data of the first document and at least some markup language 
from the second document, the markup language including text-based content and tags. 

30 88. The method of claim 86, wherein a first cost is associated with a first match at a 
first distance firom a root of a tree representation of some set of data, a second cost is 
associated with a second match at a second distance from a root of a tree representation of 
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some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are set to encourage the first match more than the second match. 



89. The method of claim 86, wherein a first cost is associated with a first insertion at a 
5 first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second insertion at a second distance from a root of a tree representation 
of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 

10 90. The method of claim 86, wherein a first cost is associated with a first deletion at a 
first distance from a root of a tree representation of some set of data, a second cost is 
* 3 associated with a second deletion at a second distance from a root of a tree representation 
y of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 
15 91. The method of claim 86, wherein a first cost is associated with a first substitution 
« at a first distance from a root of a free representation of some set of data, a second cost is 

lift associated with a second substitution at a second distance from a root of a tree 
representation of some set of data, the first distance is less than the second distance, and 
=3 the first cost and the second cost are different. 
" 20 

92. The method of claim 86, wherein a first cost is associated with a first text-based 
content substitution such that a first length of substituting text-based content is 
substantially equal to a first length of substituted text-based content, a second cost is 
associated with a second text-based content substitution such that a second length of 

25 substituting text-based content is substantially different from a second length of substituted 
text-based content, and the first cost and the second cost are set to discourage the second 
text-based content substitution more than the first text-based content substitution. 

93. The method of claim 86, wherein markup language includes at least text-based 
30 content and tags, and the one or more costs are at least peirtly set to discourage 

substitutions of text-based content for one or more tags. 
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94. The method of claim 86, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of one or more tags for text-based content. 

5 95. The method of claim 86, wherein a first cost is associated with preserving a first 
tag with unchanged attributes, a second cost is associated with preserving a second tag 
with one or more changed attributes, and the first cost and the second cost are set to 
discourage preserving the second tag more than preserving the first tag. 

10 96. The method of claim 83, wherein document data is at least partly from the first 
„^ document. 

Ijj 97. The method of claim 83, wherein document data is at least partly fi:om the second 
document. 

15 

98. The method of claim 83, wherein the second document is received if the second 
;:3 document is different fi-om the first document. 

P 99. The method of claim 83, v*^herein the markup language includes at least HTML 

20 (Hypertext Markup Language). 

100. The method of claim 83, wherein the markup language includes at least one of 
XML, a subset of XML, and a speciahzation of XML (extensible Markup Language). 

25 101. The method of claim 83, wherein the markup language includes at least WML 
(Wireless Markup Language). 

102. The method of claim 83, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a speciahzation of SGML (Standard Generalized Markup 
30 Language). 
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103. The method of claim 83, wherein the markup language includes at least text-based 
content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

5 104. The method of claim 83, further comprising: 

if two or more corresponding data are found, then: 

selecting larger selected data, at least part of the larger selected data 
including a larger subtree in a first tree representation of the first set of data, the larger 
subtree including the selected data; 
10 determining a third edit sequence between at least part of the first set of 

data and at least part of a second tree representation of the second set of data, the first set 
of data including at least part of the larger selected data, the third edit sequence including 
Uj any of insertions, deletions, and substitutions; 

finding corresponding data of the second set of data, the corresponding data having 
, -'15 a correspondence to the larger selected data, the correspondence at least partly found by 

determining the third edit sequence; and 
fji finding corresponding data of the second set of data, the corresponding data having 

a correspondence to the selected data, the correspondence at least partly found by 
C3 determining the third edit sequence. 
' 20 

105. The method of claim 83, wherein one or more of the first set of data and the second 
set of data is represented at least partly by a tree. 

106. The method of claim 83, wherein one or more of the first set of data and the second 
25 set of data is represented at least partly by a set of linearized tokens. 

107. The method of claim 83, wherein the first document and the second document 
represent different documents. 

30 108. The method of claim 83, wherein the fu-st document and the second document 
represent a same document. 
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109. The method of claim 83, wherein the first document and the second document 
represent different versions of a same document. 

110. The method of claim 83, wherein at least one of the first edit sequence and the 
5 second edit sequence includes a tree-based edit sequence. 

111. The method of claim 83, wherein at least one of determining the first edit sequence 
and determining the second edit sequence comprises: 

determining at least one edit sequence of forward and backward edit sequences 
10 between at least part of a first tree representation of the first set of data and at least part of 
a second tree representation of the second set of data; 
performing at least one of 1) and 2): 

la) pruning a relevant subtree fi-om at least part of the first tree representation, 
the relevant subtree at least partly determined fi-om the forward and backward edit 
15 sequences; 

lb) determining a pruned edit sequence between the pruned relevant subtree 
and at least part of the second tree representation; 

2a) pruning a relevant subtree from at least part of the second tree 
representation, the relevant subtree at least partly determined fi-om the forward and 
20 backward edit sequences; 

2b) determining a pruned edit sequence between at least part of the first tree 
representation and the pruned relevant subtree; and 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
25 determining the pruned edit sequence. 

112. A method of extraction, comprising: 

accessing at least a plurality of first sets of data of a plurality of first documents, 
the first docimients including markup language, wherein each of the plurality of first sets 
30 of data includes selected data, the selected data at least partly specifying document data; 

accessing at least a second set of data of a second document, the second document 
including markup language; 
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determining a most corresponding first set of data of the plurality of first sets of 
data, the most corresponding first set of data having most correspondence with the second 
set of data, by comparing partial representations of the plurality of first sets of data with a 
partial representation of the second set of data. 

5 

113. The method of claim 112, wherein document data is at least partly from one or 
more of the plurality of first documents. 

114. The method of claim 112, wherein document data is at least partly fi^om the second 
1 0 docimient. 

' 115. The method of claim 112, wherein the second document is received if the second 

W document is different fi-om at least one of the pluraUty of first documents. 

1 15 116. The method of claim 112, wherein the second document is received if the second 
- ^ document is different from all of the plurality of first documents. 

4 117. The method of claim 112, wherein the markup language includes at least HTML 

'--^ (Hypertext Markup Language). 

20 

118. The method of claim 112, wherein the markup language includes at least one of 
XML, a subset of XML, and a specialization of XML (extensible Markup Language). 

119. The method of claim 112, wherein the markup language includes at least WML 
25 (Wireless Markup Language). 

120. The method of claim 112, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML (Standard Generalized Markup 
Language). 
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121 . The method of claim 112, wherein the markup language includes at least text-based 
content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

5 122. The method of claim 112, wherein the partial representation of the second set of 
data includes a hash value computed on at least part of the second set of data. 

123. The method of claim 1 12, wherein a partial representation of a first set of data of 
the plurality of first sets of data includes a hash value computed on at least part of the first 
1 0 set of data of the plurality of first sets of data. 

=5 124. The method of claim 112, wherein the partial representation of the second set of 
i,d data includes at least a partial syntax tree of the second set of data. 

■ 15 125. The method of claim 1 12, wherein a partial representation of a first set of data of 
the plurality of first sets of data includes at least a partial syntax tree of the first set of data 
; of the plurality of first sets of data. 

-3 126. The method of claim 112, wherein the partial representation of the second set of 

20 data includes a hash value computed on at least a partial syntax tree of the second set of 
data. 

127. The method of claim 1 12, wherein a partial representation of a first set of data of 
the plurality of first sets of data includes a hash value computed on at least a partial syntax 

25 tree of the first set of data of the plurality of first sets of data. 

128. The method of claim 112, wherein the partial representation of the second set of 
data includes at least one of a part of a name of the second set of data and a part of a name 
of the second document. 

30 

129. The method of claim 112, wherein a partial representation of a first set of data of 
the plurality of first sets of data of first documents includes at least one of 1) a part of a 
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name of the first set of data of the pluraUty of first sets of data and 2) a part of a name of a 
first document of the first documents, the first document of the first documents including 
the first set of data of the pluraHty of first sets of data. 

5 

130. The method of claim 1 12, wherein at least two documents out of the first pluraHty 
of documents and the second document represent different documents. 

131. The method of claim 112, wherein at least two documents out of the first plurality 
10 of documents and the second document represent a same document. 

i3 132. The method of claim 112, wherein at least two documents out of the first plurality 
j | of documents and the second document represent different versions of a same document. 

; jf 1 5 133. A method of extraction, comprising: 

accessing at least a first tree of data of a first document, the first document 
= including markup language, wherein the first tree of data includes selected data, the 
J selected data at least partly specifying document data; 

3 accessing at least a second tree of data of a second docmnent, the second document 

20 including markup language; 

determining at least one edit sequence of forward and backward edit sequences 
between at least part of the first tree and at least part of the second tree; 
performing at least one of 1) and 2): 

la) pruning a relevant subtree from at least part of the first tree, the relevant 
25 subtree at least partly determined fi*om the forward and backward edit sequences; 

lb) determining a pruned edit sequence between the pruned relevant subtree 
and at least part of the second tree; 

2a) pruning a relevant subtree fi"om at least part of the second tree, the relevant 
subtree at least partly determined fi-om the forward and backward edit sequences; 
30 2b) determining a pruned edit sequence between at least part of the first tree 

and the pruned relevant subtree; and 
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finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
determining the pruned edit sequence. 

5 134. The method of claim 133, wherein document data is at least partly from the first 
document. 

135. The method of claim 133, wherein document data is at least partly from the second 
document. 

10 

, 136. The method of claim 133, wherein the second document is received if the second 

' 3 document is different from the first document. 

1=5 137. The method of claim 133, wherein the markup language includes at least HTML 
|;:^15 (Hypertext Markup Language) . 

; S 138. The method of claim 133, wherein the markup language includes at least one of 

j| XML, a subset of XML, and a specialization of XML (extensible Markup Language). 



20 139. 11000, wherein the markup language includes at least WML (Wireless Markup 
Language). 

140. The method of claim 133, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML (Standard GeneraUzed Markup 

25 Language). 

141. The method of claim 133, wherein the markup language includes at least text-based 
content and tags, the tags detaihng one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

30 

142. The method of claim 133, wherein determining forward and backward edit 
sequences, pruning a relevant subtree, and determining a pruned edit sequence are 
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performed for each of a plurality of subtree pairs, each of the plurality of subtree pairs 
including a subtree from the first tree and a subtree from the second tree. 



143. The method of claim 133, wherein the first document and the second document 
represent different docimients. 

5 

144. The method of claim 133, wherein the first document and the second document 
represent a same document. 

145. 133, wherein the first document and the second document represent different 
1 0 versions of a same document. 

1 46. A method of extracting relevant data, comprising: 

accessing at least a first set of data of a first document, the first document including 
markup language, wherein the first set of data includes selected data of the first document, 
15 the selected data at least partly specifying document data; 

accessing at least a second set of data of a second document, the second document 
including markup language; 

determining a first edit sequence between at least part of the first set of data and at 
least part of the second set of data, the first edit sequence including any of insertions, 
20 deletions, and substitutions; 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
determining the first edit sequence; 

if two or more corresponding data are found, then: 
25 selecting larger selected data, at least part of the larger selected data 

including a larger subtree in a tree representation of the first set of data, the larger subtree 
including the selected data; 

determining a second edit sequence between at least part of the first set of 
data and at least part of the second set of data, the first set of data including at least part of 
30 the larger selected data, the second edit sequence including any of insertions, deletions, 
and substitutions; 
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finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the larger selected data, the correspondence at least partly found by 
determining the second edit sequence; and 

finding corresponding data of the second set of data, the corresponding data having 
5 a correspondence to the selected data, the correspondence at least partly found by 
determining the second edit sequence. 

147. The method of claim 146, wherein docviment data is at least partly from the first 
document. 

10 

148. The method of claim 146, wherein document data is at least partly from the second 
document. 

149. The method of claim 146, wherein the second document is received if the second 
1 5 document is different from the first document. 

150. The method of claim 146, wherein the markup language includes at least HTML 
(Hypertext Markup Language). 

20 151. The method of claim 146, wherein the markup language includes at least one of 
XML, a subset of XML, and a specialization of XML (extensible Markup Language). 

152. The method of claim 146, wherein the markup language includes at least WML 
(Wireless Markup Language). 

25 

153. The method of claim 146, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML (Standard Generahzed Markup 
Language). 

30 154. The method of claim 146, wherein the markup language includes at least text-based 
content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 
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155. The method of claim 146, wherein the first document and the second document 
represent different documents. 



5 156. The method of claim 146, wherein the first document and the second document 
represent a same document. 



157. The method of claim 146, wherein the first document and the second document 
represent different versions of a same document. 

10 

158. A method of extraction, comprising: 

Q accessing at least a first tree of data of a first document, the first document 

I ;! including markup language, wherein the first tree of data includes selected data, the 
1;,^ selected data at least partly specifying document data; 

^ 0 15 accessing at least a second tree of data of a second document, the second document 

■s including markup language; 

:=s performing tree traversal on at least part of the second tree, the tree traversal at 

least partly guided by the selected data and by at least part of the first tree; and 
Q if tree traversal fails due to one or more differences between at least part of the 

20 second tree and at least part of the selected data, then: 

determining an edit sequence between at least part of the second tree and at 
least part of the first tree, the first tree including at least part of the selected data; 

finding corresponding data for at least part of the second tree, the 
corresponding data having a correspondence to at least part of the selected data, the 
25 correspondence at least partly found by determining the edit sequence; and 

continuing to perform tree traversal on at least part of the second tree, the 
tree traversal at least partly guided by the corresponding data. 

159. The method of claim 158, wherein for subsequent set tree traversal failures, 
determining, finding and continuing are repeated. 

30 
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160. The method of claim 158, wherein document data is at least partly from the first 
document. 

161. The method of claim 158, wherein document data is at least partly from the second 
5 document. 

162. The method of claim 158, wherein the second document is received if the second 
document is different from the first document, 

10 163. The method of claim 158, wherein the markup language includes at least HTML 
(Hypertext Markup Language). 

164. The method of claim 158, wherein the markup language includes at least one of 
XML, a subset of XML, and a specialization of XML (extensible Markup Language). 

15 

165. The method of claim 158, wherein the markup language includes at least WML 
(Wireless Markup Language). 

166. The method of claim 158, wherein the markup language includes at least one of 
20 SGML, a subset of SGML, and a speciahzation of SGML (Standard Generahzed Markup 

Language). 

167. The method of claim 1 58, wherein the markup language includes at least text-based 
content and tags, the tags detailing one or more of structure of content, semantics of 

25 content, and formatting information about text-based content. 

168. The method of claim 158, wherein the first document and the second document 
represent different documents. 

30 169. The method of claim 158, wherein the first document and the second document 
represent a same document. 
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170. The method of claim 158, wherein the first document and the second document 
represent different versions of a same document. 

171. A method of extracting relevant data, comprising: 

5 accessing at least a first set of data of a first document, the first document including 

markup language, wherein the first set of data includes selected data of the first document, 
the selected data at least partly specifying document data; 

accessing at least a second set of data of a second document, the second document 
including markup language; 
10 determining an edit sequence between the first set of data and the second set of 

data, the edit sequence including any of insertions, deletions, and substitutions; and 
■i3 if the edit sequence fails a test, determining a tree-based edit sequence between the 

III first set of data and the second set of data, the edit sequence including any of insertions, 
• deletions, and substitutions. 

;ii5 

172. The method of claim 171, wherein document data is at least partly from the first 
ri document. 

; 3 173. The method of claim 171, wherein document data is at least partly from the second 

20 document. 

1 74. The method of claim 171, wherein the second document is received if the second 
document is different fi-om the first document. 

25 175. The method of claim 171, wherein the markup language includes at least HTML 
(Hypertext Markup Language). 

176. The method of claim 171, wherein the markup language includes at least one of 
XML, a subset of XML, and a speciaUzation of XML (extensible Markup Language). 

30 

177. The method of claim 171, wherein the markup language includes at least WML 
(Wireless Markup Language). 
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178. The method of claim 171, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a speciahzation of SGML (Standard Generalized Markup 
Language). 

5 

1 79. The method of claim 171, wherein the markup language includes at least text-based 
content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

10 180. The method of claim 171, wherein the first document and the second document 
represent different documents. 

181. The method of claim 171, wherein the first document and the second document 
represent a same document. 

15 

182. The method of claim 171, wherein the first document and the second document 
represent different versions of a same document. 

183. A method of extraction, comprising: 

20 accessing at least a first set of data of a first document, the first document including 

markup language, wherein the fiSrst set of data includes selected data, the selected data at 
least partly specifying document data; 

accessing at least a second set of data of a second document, the second document 
including markup language; 

25 determining document data of the second set of data, by finding corresponding data 

of the second set of data, the corresponding data having a correspondence to the selected 
data of the first set of data, the correspondence at least partly determined by a first edit 
sequence between at least part of the first set of data and at least part of the second set of 
data, the first edit sequence including any of insertions, deletions, and substitutions; 

30 identifying the corresponding data of the second set of data as selected data of the 

second set of data, the selected data at least partly specifying document data; 
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accessing at least a third set of data of a third document, the third document 
including markup language; and 

determining document data of the third set of data, by finding corresponding data 
of the third set of data, the corresponding data having a correspondence to at least one of 
5 the selected data of the first set of data and the selected data of the second set of data, the 
correspondence at least partly determined by a second edit sequence between at least part 
of the third set of data and at least one of at least part of the first set of data and at least 
part of the second set of data, the second edit sequence including any of insertions, 
deletions, and substitutions. 

10 

184. The method of claim 183, wherein at least one of the first edit sequence and the 
i 3 second edit sequence includes none of insertions, deletions, and substitutions. 

ft 185. The method of claim 183, wherein at least one of the first edit sequence and the 
-^15 second edit sequence includes at least one of one or more insertions, one or more 
s deletions, and one or more substitutions, 

, ^ 186. The method of claim 183, wherein at least one of the first edit sequence and the 

,3 second edit sequence is at least partly determined by calculating a total cost, and each of 

20 one or more of insertions, deletions, substitutions, and matches is associated with one or 
more costs. 

187. The method of claim 185, wherein the one or more costs are at least partly set to 
encourage the edit sequence to include one or more matches between at least some markup 
language fi-om the selected data of the first document and at least some markup language 

25 irom the second document, the markup language including text-based content and tags. 

188. The method of claim 185, wherein a first cost is associated with a first match at a 
first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second match at a second distance fi-om a root of a tree representation of 

30 some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are set to encourage the first match more than the second match. 
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189. The method of claim 185, wherein a first cost is associated with a first insertion at 
a first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second insertion at a second distance from a root of a tree representation 
of some set of data, the first distance is less than the second distance, and the first cost and 

5 the second cost are different. 

190. The method of claim 185, wherein a first cost is associated with a first deletion at a 
first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second deletion at a second distance from a root of a tree representation 

10 of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 

(Ti 191 . The method of claim 185, wherein a first cost is associated with a first substitution 
r at a first distance from a root of a tree representation of some set of data, a second cost is 
^X=15 associated with a second substitution at a second distance from a root of a tree 
representation of some set of data, the first distance is less than the second distance, and 
ili the first cost and the second cost are different. 

192. The method of claim 185, wherein a first cost is associated with a first text-based 
20 content substitution such that a first length of substituting text-based content is 

substantially equal to a first length of substituted text-based content, a second cost is 
associated with a second text-based content substitution such that a second length of 
substituting text-based content is substantially different from a second length of substituted 
text-based content, and the first cost and the second cost are set to discourage the second 
25 text-based content substitution more than the first text-based content substitution. 

193. The method of claim 185, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of text-based content for one or more tags. 

30 
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194. The method of claim 185, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of one or more tags for text-based content. 

5 195. The method of claim 185, wherein a first cost is associated with preserving a first 
tag with unchanged attributes, a second cost is associated with preserving a second tag 
with one or more changed attributes, and the first cost and the second cost are set to 
discourage preserving the second tag more than preserving the first tag. 

10 196. The method of claim 183, wherein subsequent sets of data of documents are 
received, the documents including markup language, document data of the subsequent sets 
of data are determined by finding corresponding data of the subsequent sets of data, the 
corresponding data of the subsequent sets correspond to the selected data of earlier sets of 
data, the corresponding data of the subsequent sets are identified as selected data of the 

15 subsequent sets of data, the selected data of the subsequent sets of data at least partly 
specifying document data, and at least one of selected data of the earher sets and the 
selected data of the subsequent data at least partly determine coiyesponding data of later 
sets of data, the earlier sets of data are received earlier than the subsequent sets of data, and 
the later sets of data are received later than the subsequent sets of data. 

20 

197. The method of claim 183, wherein document data is at least partly from the first 
document. 

198. The method of claim 183, wherein document data is at least partly firom the second 
25 document. 

199. The method of claim 183, wherein document data is at least partly fi-om the third 
document. 

30 200. The method of claim 183, wherein the second document is received if the second 
document is different from the first document. 
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201. The method of claim 183, wherein the markup language includes at least HTML 
(Hypertext Markup Language). 



202. The method of claim 183, wherein the markup language includes at least one of 
5 XML, a subset of XML, and a speciahzation of XML (extensible Markup Language). 

203. The method of claim 183, wherein the markup language includes at least WML 
(Wireless Markup Language). 

10 204. The method of claim 183, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML (Standard Generalized Markup 
Language). 

, ' 205. The method of claim 183, wherein the markup language includes at least text-based 
■0 15 content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

Jf 206. Themethodof claim 183, further comprising; 

if two or more corresponding data are found, then: 
20 selecting larger selected data, at least part of the larger selected data 

including a larger subtree in a first tree representation of the first set of data, the larger 
subtree including the selected data; 

determining a third edit sequence between at least part of the first set of 
data and at least part of a second tree representation of the second set of data, the first set 
25 of data including at least part of the larger selected data, the third edit sequence including 
any of insertions, deletions, and substitutions; 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the larger selected data, the correspondence at least partly found by 
determining the third edit sequence; and 
30 finding corresponding data of the second set of data, the corresponding data having 

a correspondence to the selected data, the correspondence at least partly found by 
determining the third edit sequence. 
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207. The method of claim 183, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a tree. 

5 208. The method of claim 183, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a set of linearized tokens. 

209. The method of claim 183, wherein at least two of the first document, the second 
document, and the third document represent different documents. 

10 

210. The method of claim 183, wherein at least two of the first document, the second 
I 3 document, and the third document represent a same document. 

W 211. The method of claim 183, wherein at least two of the first document, the second 
1 5 document, and the third document represent different versions of a same document. 

;;: f 212. The method of claim 183, wherein at least one of the first edit sequence and the 
ill second edit sequence includes a tree-based edit sequence. 

' * 20 213. The method of claim 1 83, wherein determining the edit sequence comprises: 

determining at least one edit sequence of forward and backward edit sequences 
between at least part of a first tree representation of the first set of data and at least part of 
a second tree representation of the second set of data; 
performing at least one of 1) and 2): 
25 la) pruning a relevant subtree from at least part of the first tree representation, 

the relevant subtree at least partly determined from the forward and backward edit 
sequences; 

lb) determining a pruned edit sequence between the pruned relevant subtree 
and at least part of the second tree representation; 
30 2a) pruning a relevant subtree from at least part of the second tree 

representation, the relevant subtree at least partly determined from the forward and 
backward edit sequences; 
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2b) determining a pruned edit sequence between at least part of the first tree 
representation and the pruned relevant subtree; and 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
5 determining the pruned edit sequence. 

214. A method of extraction, comprising: 

accessing at least a first set of data of a first document, the first docimient including 
markup language, wherein the first set of data includes selected data, the selected data at 
1 0 least partly specifying document data; 

accessing at least a second set of data of a second document, the second document 
,3 including markup language; 

finding one or more sets of corresponding data of the second set of data, each of 
one or more sets of corresponding data having a strength of correspondence to the selected 
."-15 data of the first set of data, the strength of correspondence at least partly determined by an 
' edit sequence between at least part of the second set of data and at least part of the first set 
of data, the edit sequence including any of insertions, deletions, and substitutions; 
i y if two or more sets of corresponding data are found, then 1) if one of the 

Q corresponding sets of data has a substantially higher strength of correspondence than 
20 strengths of correspondence of the other corresponding sets of data, assigning a high 
measure of quahty to the selection of the selected data, and 2) if none of the corresponding 
sets of data has a substantially higher strength of correspondence than strengths of 
correspondence of the other corresponding sets of data, assigning a low measure of quahty 
to the selection of the selected data. 

25 

215. The method of claim 214, wherein the edit sequence includes none of insertions, 
deletions, and substitutions. 

216. The method of claim 214, wherein the edit sequence includes at least one of one or 
30 more insertions, one or more deletions, and one or more substitutions. 
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217. The method of claim 214, wherein the edit sequence is at least partly determined by 
calculating a total cost, and each of one or more of insertions, deletions, substitutions, and 
matches is associated with one or more costs. 

5 218. The method of claim 217, wherein the one or more costs are at least partly set to 
encourage the edit sequence to include one or more matches between at least some markup 
language from the selected data of the first document and at least some markup language 
from the second document, the markup language including text-based content and tags. 

10 219. The method of claim 217, wherein a first cost is associated with a first match at a 
first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second match at a second distance from a root of a tree representation of 
! some set of data, the first distance is less than the second distance, and the first cost and 
i. J the second cost are set to encourage the first match more than the second match. 
Si 5 

220. The method of claim 217, wherein a first cost is associated with a first insertion at 
a first distance from a root of a tree representation of some set of data, a second cost is 
i'll associated with a second insertion at a second distance from a root of a free representation 
T;^ of some set of data, the first distance is less than the second distance, and the first cost and 
* 20 the second cost are different. 

221 The method of claim 217, wherein a first cost is associated with a first deletion at a 
first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second deletion at a second distance from a root of a tree representation 
25 of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 

222. The method of claim 217, wherein a first cost is associated with a first substitution 
at a first distance from a root of a free representation of some set of data, a second cost is 
30 associated with a second substitution at a second distance from a root of a tree 
representation of some set of data, the first distance is less than the second distance, and 
the first cost and the second cost are different. 
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223. The method of claim 217, wherein a first cost is associated with a first text-based 
content substitution such that a first length of substituting text-based content is 
substantially equal to a first length of substituted text-based content, a second cost is 
associated with a second text-based content substitution such that a second length of 
substituting text-based content is substantially different from a second length of substituted 
text-based content, and the first cost and the second cost are set to discourage the second 
text-based content substitution more than the first text-based content substitution. 

224. The method of claim 217, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of text-based content for one or more tags. 

225. The method of claim 217, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of one or more tags for text-based content. 

226. The method of claim 217, wherein a first cost is associated with preserving a first 
tag with unchanged attributes, a second cost is associated with preserving a second tag 
with one or more changed attributes, and the first cost and the second cost are set to 
discourage preserving the second tag more than preserving the first tag. 

227. The method of claim 214, wherein document data is at least partly from the first 
document. 

228. The method of claim 214, wherein document data is at least partly from the second 
document. 

229. The method of claim 214, wherein the second document is received if the second 
document is different from the first document. 
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230. The method of claim 214, wherein the markup language includes at least HTML 
(Hypertext Markup Language). 

231. The method of claim 214, wherein the markup language includes at least one of 
XML, a subset of XML, and a speciahzation of XML (extensible Markup Language). 

232. The method of claim 214, wherein the markup language includes at least WML 
(Wireless Markup Language). 

233. The method of claim 214, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML (Standard Generahzed Markup 
Language). 

234. The method of claim 214, wherein the markup language includes at least text-based 
content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

235. The method of claim 214, further comprising: 

if two or more corresponding data are found, then: 

selecting larger selected data, at least part of the larger selected data 
including a larger subtree in a first tree representation of the first set of data, the larger 
subtree including the selected data; 

determining a second edit sequence between at least part of the first set of 
data and at least part of a second tree representation of the second set of data, the first set 
of data including at least part of the larger selected data, the second edit sequence 
including any of insertions, deletions, and substitutions; 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the larger selected data, the correspondence at least partly found by 
determining the second edit sequence; and 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
determining the second edit sequence. 
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236. The method of claim 214, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a tree. 

5 237. The method of claim 214, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a set of linearized tokens. 

238. The method of claim 214, wherein the first document and the second document 
represent different documents. 

10 

239. The method of claim 214, wherein the first document and the second dociunent 
3 represent a same document. 

240. The method of claim 214, wherein the first document and the second document 
represent different versions of a same document. 

241. The method of claim 214, wherein at least one of the first edit sequence and the 
second edit sequence includes a tree-based edit sequence. 

m20 242. The method of claim 214, wherein determining the edit sequence comprises: 

determining at least one edit sequence of forward and backward edit sequences 
between at least part of a first tree representation of the first set of data and at least part of 
a second tree representation of the second set of data; 
performing at least one of 1) and 2): 
25 la) pruning a relevant subtree firom at least part of the first tree representation, 

the relevant subtree at least partly determined from the forward and backward edit 
sequences; 

lb) determining a pruned edit sequence between the pruned relevant subtree 
and at least part of the second tree representation; 
30 2a) pruning a relevant subtree from at least part of the second tree 

representation, the relevant subtree at least partly determined from the forward and 
backward edit sequences; 
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2b) determining a pruned edit sequence between at least part of the first tree 
representation and the pruned relevant subtree; and 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
5 determining the pruned edit sequence. 

243. A method of extraction, comprising: 

accessing at least a fu^st set of data of a first document, the first document including 
markup language, wherein the first set of data includes a first selected subset and a second 
10 selected subset, such that the second selected subset of data is a subset of the first selected 
subset of data, the first selected subset at least partly specifying document data, the second 
i'3 selected subset at least partly specifying document data; 

;. 5 accessing at least a second set of data of a second document, the second document 

including markup language; 
15 finding a first corresponding subset of the second set of data, the first 

Cii corresponding subset having a correspondence to the first selected subset; and 
f == finding a second corresponding subset of the second set of data, the second 

corresponding subset having a correspondence to the second selected subset. 

i' 220 244. The method of claim 243, wherein document data is at least partly from the first 
document. 

245. The method of claim 243, wherein document data is at least partly fi-om the second 
document. 

25 

246. The method of claim 243, wherein the second document is received if the second 
document is different from the first document. 

247. The method of claim 243, wherein the markup language includes at least HTML 
30 (Hypertext Markup Language). 
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248. The method of claim 243, wherein the markup language includes at least one of 
XML, a subset of XML, and a specialization of XML (extensible Markup Language). 

249. The method of claim 243, wherein the markup language includes at least WML 
5 (Wireless Markup Language). 

250. The method of claim 243, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML (Standard Generalized Markup 
Language). 

10 

25 1 . The method of claim 243, wherein the markup language includes at least text-based 
r;3 content and tags, the tags detailing one or more of structure of content, semantics of 

content, and formatting information about text-based content. 

CI 5 252. The method of claim 243, further comprising: 

i:;o 

if two or more corresponding data are found, then: 

selecting larger selected data, at least part of the larger selected data 
50 including a larger subtree in a first tree representation of the first set of data, the larger 
;: 3 subtree including the selected data; 

i io determining a first edit sequence between at least part of the first set of data 

and at least part of a second tree representation of the second set of data, the first set of 
data including at least part of the larger selected data, the first edit sequence including any 
of insertions, deletions, and substitutions; 

finding corresponding data of the second set of data, the corresponding data having 
25 a correspondence to the larger selected data, the correspondence at least partly found by 
determining the first edit sequence; and 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
determining the first edit sequence. 

30 

253. The method of claim 243, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a tree. 
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254. The method of claim 243, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a set of linearized tokens. 

5 255. The method of claim 243, wherein the first document and the second document 
represent different documents. 

256. The method of claim 243, wherein the first document and the second document 
represent a same document. 

10 

257. The method of claim 243, wherein the first document and the second document 
w represent different versions of a same docviment. 

: 258. A method of extraction, comprising: 

^;£|5 accessing at least a first set of data of a first document, the first document including 

jjfl markup language, wherein the first set of data includes selected data, the selected data at 
f >=i least partly specifying document data; 

- accessing at least a second set of data of a second document, the second docimient 

□ including markup language; 

i ' jo determining document data of the second set of data, by finding corresponding data 

of the second set of data, the corresponding data having a correspondence to the selected 
data of the first set of data, the correspondence at least partly determined by a first tree- 
based edit sequence between at least part of the first set of data and at least part of the 
second set of data, the first tree-based edit sequence including any of insertions, deletions, 
25 and substitutions; 

identifying the corresponding data of the second set of data as selected data of the 
second set of data, the selected data at least partly specifying document data; 

accessing at least a third set of data of a third document, the third document 
including markup language; and 
30 determining document data of the third set of data, by finding corresponding data 

of the third set of data, the corresponding data having a correspondence to at least one of 
the selected data of the first set of data and the selected data of the second set of data, the 
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correspondence at least partly determined by a second tree-based edit sequence between at 
least part of the third set of data and at least one of at least part of the first set of data and 
at least part of the second set of data, the second tree-based edit sequence including any of 
insertions, deletions, and substitutions. 

5 

259. The method of claim 258, wherein at least one of the first tree-based edit sequence 
and the second tree-based edit sequence includes none of insertions, deletions, and 
substitutions. 

10 260. The method of claim 258, wherein at least one of the first tree-based edit sequence 
and the second tree-based edit sequence includes at least one of one or more insertions, 
Q one or more deletions, and one or more substitutions. 

;t 261. The method of claim 258, wherein at least one of the first tree-based edit sequence 
^=il5 and the second tree-based edit sequence is at least partly determined by calculating a total 
cost, and each of one or more of insertions, deletions, substitutions, and matches is 
associated with one or more costs. 

r i 262. The method of claim 261, wherein the one or more costs are at least partly set to 
Jo encourage the tree-based edit sequence to include one or more matches between at least 
some markup language from the selected data of the first document and at least some 
markup language from the second document, the markup language including text-based 
content and tags. 

25 263. The method of claim 261, wherein a first cost is associated with a first match at a 
first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second match at a second distance from a root of a tree representation of 
some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are set to encourage the first match more than the second match. 

30 

264. The method of claim 261, wherein a first cost is associated with a first insertion at 
a first distance from a root of a tree representation of some set of data, a second cost is 
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associated with a second insertion at a second distance from a root of a tree representation 
of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 

5 265. The method of claim 261, wherein a first cost is associated with a first deletion at a 
first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second deletion at a second distance from a root of a tree representation 
of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 

10 

266. The method of claim 261, wherein a first cost is associated with a first substitution 
Q at a first distance from a root of a tree representation of some set of data, a second cost is 
, j associated with a second substitution at a second distance fi-om a root of a tree 

representation of some set of data, the first distance is less than the second distance, and 
^=^15 the first cost and the second cost are different. 

p"=; 267. The method of claim 261, wherein a first cost is associated with a first text-based 
. " content substitution such that a first length of substituting text-based content is 
substantially equal to a first length of substituted text-based content, a second cost is 
i^ jO associated with a second text-based content substitution such that a second length of 
substituting text-based content is substantially different firom a second length of substituted 
text-based content, and the first cost and the second cost are set to discourage the second 
text-based content substitution more than the first text-based content substitution. 

25 268. The method of claim 261, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of text-based content for one or more tags. 

269. The method of claim 261, wherein markup language includes at least text-based 
30 content and tags, and the one or more costs are at least partly set to discourage 
substitutions of one or more tags for text-based content. 



Attorney Docket No. 25961-704 
C:\NrPortbl\PALIB 1\KS6\1 3688 14_1 .DOC 



75 



270. The method of claim 261, wherein a first cost is associated with preserving a first 
tag with unchanged attributes, a second cost is associated with preserving a second tag 
with one or more changed attributes, and the first cost and the second cost are set to 
discourage preserving the second tag more than preserving the first tag. 

5 

271. The method of claim 258, wherein subsequent sets of data of documents are 
received, the documents including markup language, document data of the subsequent sets 
of data are determined by finding corresponding data of the subsequent sets of data, the 
corresponding data of the subsequent sets correspond to the selected data of earlier sets of 

10 data, the corresponding data of the subsequent sets are identified as selected data of the 
subsequent sets of data, the selected data of the subsequent sets of data at least partly 
Q specifying document data, and at least one of selected data of the earlier sets and the 
: 3 selected data of the subsequent data at least partly determine corresponding data of later 
; v; sets of data, the earlier sets of data are received earlier than the subsequent sets of data, and 
^=S15 the later sets of data are received later than the subsequent sets of data. 

272. The method of claim 258, wherein document data is at least partly from the first 
document. 

r Jo 273. The method of claim 258, wherein document data is at least partly from the second 
document. 

274. The method of claim 258, wherein document data is at least partly from the third 
document. 

25 

275. The method of claim 258, wherein the second document is received if the second 
document is different from the first document. 

276. The method of claim 258, wherein the markup language includes at least HTML 
30 (Hypertext Markup Language). 
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277. The method of claim 258, wherein the markup language includes at least one of 
XML, a subset of XML, and a specialization of XML (extensible Markup Language). 

278. The method of claim 258, wherein the markup language includes at least WML 
5 (Wireless Markup Language). 

279. The method of claim 258, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML (Standard Generalized Markup 
Language). 

10 

280. The method of claim 258, wherein the markup language includes at least text-based 
content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

15 281. The method of claim 258, further compri sing : 

; if two or more corresponding data are fovind, then: 

selecting larger selected data, at least part of the larger selected data 

including a larger subtree in a first tree representation of the first set of data, the larger 

subtree including the selected data; 
"20 determining a third tree-based edit sequence between at least part of the 

first set of data and at least part of a second tree representation of the second set of data, 

the first set of data including at least part of the larger selected data, the third tree-based 

edit sequence including any of insertions, deletions, and substitutions; 

finding corresponding data of the second set of data, the corresponding data having 
25 a correspondence to the larger selected data, the correspondence at least partly found by 

determining the third tree-based edit sequence; and 

finding corresponding data of the second set of data, the corresponding data having 

a correspondence to the selected data, the correspondence at least partly found by 

determining the third tree-based edit sequence. 

30 

282/ The method of claim 258, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a tree. 
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283 The method of claim 258, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a set of linearized tokens. 

5 284. The method of claim 258, wherein at least two of the first document, the second 
document, and the third document represent different documents. 

285. The method of claim 258, wherein at least two of the first document, the second 
document, and the third document represent a same document. 

10 

286. The method of claim 258, wherein at least two of the first document, the second 
O document, and the third document represent different versions of a same document. 

287 The method of claim 258, wherein at least one of the first tree-based edit sequence 
•:;2l5 and the second tree-based edit sequence includes a tree-based tree-based edit sequence. 

y'2 288. The method of claim 258, wherein determining the tree-based edit sequence 
comprises: 

i'3 determining at least one tree-based edit sequence of forward and backward edit 

^520 sequences between at least part of a first tree representation of the first set of data and at 
least part of a second tree representation of the second set of data; 
performing at least one of 1) and 2): 

la) pruning a relevant subtree fi-om at least part of the first tree representation, 
the relevant subtree at least partly determined fi-om the forward and backward edit 
25 sequences; 

lb) determining a pruned tree-based edit sequence between the pruned relevant 
subtree and at least part of the second tree representation; 

2a) pruning a relevant subtree from at least part of the second tree 
representation, the relevant subtree at least partly determined fi-om the forward and 
30 backward edit sequences; 

2b) determining a pruned tree-based edit sequence between at least part of the 
first tree representation and the pruned relevant subtree; and 
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finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
determining the pruned tree-based edit sequence. 

5 289. A method of extraction, comprising: 

accessing at least a first set of data of a first document, the first document including 
markup language, wherein the first set of data includes selected data, the selected data at 
least partly specifying document data; 

accessing at least a second set of data of a second document, the second document 
10 including markup language; 

finding one or more sets of corresponding data of the second set of data, each of 
''3 one or more sets of corresponding data having a strength of correspondence to the selected 
,5 data of the first set of data, the strength of correspondence at least partly determined by 
i some tree-based edit sequence between at least part of the second set of data and at least 

%.5 part of the first set of data, the tree-based edit sequence including any of insertions, 
L'O deletions, and substitutions; 

if two or more sets of corresponding data are found, then 1) if one of the 
; corresponding sets of data has a substantially higher strength of correspondence than 
C3 strengths of correspondence of the other corresponding sets of data, assigning a high 
lilO measure of quahty to the selection of the selected data, and 2) if none of the corresponding 
sets of data has a substantially higher strength of correspondence than strengths of 
correspondence of the other corresponding sets of data, assigning a low measure of quality 
to the selection of the selected data. 

25 290. The method of claim 289, wherein the tree-based edit sequence includes none of 
insertions, deletions, and substitutions. 

291. The method of claim 289, wherein the tree-based edit sequence includes at least 
one of one or more insertions, one or more deletions, and one or more substitutions. 

30 
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292. The method of claim 289, wherein the tree-based edit sequence is at least partly 
determined by calculating a total cost, and each of one or more of insertions, deletions, 
substitutions, and matches is associated with one or more costs. 

5 293. The method of claim 292, wherein the one or more costs are at least partly set to 
encourage the tree-based edit sequence to include one or more matches between at least 
some markup language from the selected data of the first document and at least some 
markup language from the second document, the markup language including text-based 
content and tags. 

10 

294. The method of claim 292, wherein a first cost is associated with a first match at a 
^ first distance from a root of a tree representation of some set of data, a second cost is 
^ii associated with a second match at a second distance from a root of a tree representation of 

some set of data, the first distance is less than the second distance, and the first cost and 
il5 the second cost are set to encourage the first match more than the second match. 

295. The method of claim 292, wherein a first cost is associated with a first insertion at 
: a first distance from a root of a tree representation of some set of data, a second cost is 

associated with a second insertion at a second distance from a root of a tree representation 
lllO of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 

296. The method of claim 292, wherein a first cost is associated with a first deletion at a 
first distance from a root of a tree representation of some set of data, a second cost is 

25 associated with a second deletion at a second distance from a root of a free representation 
of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 

297. The method of claim 292, wherein a first cost is associated with a first substitution 
30 at a first distance from a root of a free representation of some set of data, a second cost is 

associated with a second substitution at a second distance from a root of a tree 
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representation of some set of data, the first distance is less than the second distance, and 
the first cost and the second cost are different. 

298. The method of claim 292, wherein a first cost is associated with a first text-based 
content substitution such that a first length of substituting text-based content is 
substantially equal to a first length of substituted text-based content, a second cost is 
associated with a second text-based content substitution such that a second length of 
substituting text-based content is substantially different firom a second length of substituted 
text-based content, and the first cost and the second cost are set to discourage the second 
text-based content substitution more than the first text-based content substitution. 

299. The method of claim 292, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of text-based content for one or more tags. 

300. The method of claim 292, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of one or more tags for text-based content. 

301. The method of claim 292, wherein a first cost is associated with preserving a first 
tag with unchanged attributes, a second cost is associated with preserving a second tag 
with one or more changed attributes, and the first cost and the second cost are set to 
discourage preserving the second tag more than preserving the first tag. 

302. The method of claim 289, wherein document data is at least partly fi-om the first 
document. 

303. The method of claim 289, wherein document data is at least partly firom the second 
document. 

304. The method of claim 289, wherein the second document is received if the second 
document is different fi-om the first document. 
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305. The method of claim 289, wherein the markup language includes at least HTML 
(Hypertext Markup Language). 



5 306. The method of claim 289, wherein the markup language includes at least one of 
XML, a subset of XML, and a specialization of XML (extensible Markup Language). 

307. The method of claim 289, wherein the markup language includes at least WML 
(Wireless Markup Language). 

10 

308. The method of claim 289, wherein the markup language includes at least one of 
';i SGML, a subset of SGML, and a speciahzation of SGML (Standard Generalized Markup 
W Language), 

j'SlS 309. The method of claim 289, wherein the markup language includes at least text-based 
• '-^ content and tags, the tags detailing one or more of structure of content, semantics of 
C3 content, and formatting information about text-based content. 

1:5 310. The method of claim 289, further comprising: 
i=*20 if two or more corresponding data are found, then: 

selecting larger selected data, at least part of the larger selected data 
including a larger subtree in a first tree representation of the first set of data, the larger 
subtree including the selected data; 

determining a second tree-based edit sequence between at least part of the 
25 first set of data and at least part of a second tree representation of the second set of data, 
the first set of data including at least part of the larger selected data, the second tree-based 
edit sequence including any of insertions, deletions, and substitutions; 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the larger selected data, the correspondence at least partly found by 
30 determining the second tree-based edit sequence; and 
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finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
determining the second tree-based edit sequence. 

5 311. The method of claim 289, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a tree. 

312. The method of claim 289, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a set of linearized tokens. 

10 

313. The method of claim 289, wherein the first document and the second document 
represent different documents. 

i^y 314. The method of claim 289, wherein the first document and the second document 
T5 represent a same document. 

i!J 315. The method of claim 289, wherein the first document and the second document 
represent different versions of a same document. 

1 *20 316. The method of claim 289, wherein at least one of the first tree-based edit sequence 
and the second tree-based edit sequence includes a tree-based tree-based edit sequence. 

317. The method of claim 289, wherein determining the tree-based edit sequence 
comprises: 

25 determining at least one tree-based edit sequence of forward and backward edit 

sequences between at least part of a first tree representation of the first set of data and at 
least part of a second tree representation of the second set of data; 
performing at least one of 1) and 2): 

1 a) pruning a relevant subtree fi-om at least part of the first tree representation, 
30 the relevant subtree at least partly determined fi-om the forward and backward edit 
sequences; 
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lb) determining a pruned tree-based edit sequence between the pruned relevant 
subtree and at least part of the second tree representation; 

2a) pruning a relevant subtree from at least part of the second tree 
representation, the relevant subtree at least partly determined from the forward and 
backward edit sequences; 

2b) determining a pruned tree-based edit sequence between at least part of the 
first tree representation and the pruned relevant subtree; and 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
determining the pruned tree-based edit sequence. 
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